CERN-TH/2003-250 



Solution of the Dirac equation in lattice QCD 

O ■ 

O ; using a domain decomposition method 

^— > . 

' Martin Liischer 

o . 

CERN, Theory Division 
\ CH-1211 Geneva 23, Switzerland 

> . 

QO , Abstract 

Efficient algorithms for the solution of partial differential equations on parallel computers 
are often based on domain decomposition methods. Schwarz preconditioners combined with 
standard Krylov space solvers are widely used in this context, and such a combination is 
shown here to perform very well in the case of the Wilson-Dirac equation in lattice QCD. In 
particular, with respect to even-odd preconditioned solvers, the communication overhead 
is significantly reduced, which allows the computational work to be distributed over a large 
number of processors with only small parallelization losses. 
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1. Introduction 

At present, numerical simulations of lattice QCD including sea-quark effects are still 
limited to relatively small lattices and large quark masses. The rapid increase in the 
computer power that is available for these calculations will certainly help to improve 
the situation, but it may also be necessary to develop better algorithms to be able to 
reach the chiral regime, where the effects associated with the spontaneous breaking 
of chiral symmetry become important. 

In this connection the applicability of domain decomposition methods [1,2] seems 
worth studying, given the fact that the standard formulations of lattice QCD involve 
nearest-neighbour interactions only. A first step in this direction was made in ref. [3], 
where a simulation algorithm for two-flavour lattice QCD was proposed, which can 
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be regarded as an implementation of the classic Schwarz alternating procedure [4] 
at the quantum level. 

The classic Schwarz procedure may also be used directly for the solution of the 
lattice Dirac equation, and the aim in the present paper is to show that this leads to 
a competitive solver, particularly so if the computational work is to be distributed 
over a large number of nodes of a parallel computer. While this result is immediately 
relevant to the QCD simulation algorithms that are currently in use, the study of the 
Schwarz procedure in this context also serves as a preparation for the implementation 
of the proposed Schwarz simulation algorithm [3]. 

For simplicity only the standard Wilson formulation of lattice QCD [5] will be con- 
sidered, but 0(a) improvement [6,7] can easily be included and the basic strategies 
are in any case expected to be more generally applicable. The Schwarz procedure 
for the solution of the lattice Dirac equation looks fairly natural in this framework 
and is explained in all detail. By itself this method is not very efficient, but it can 
be combined, in the form of a preconditioner for the Wilson-Dirac operator, with 
the GCR Krylov space accelerator. The excellent performance of the resulting algo- 
rithm is then demonstrated by a series of test runs on a recent PC-cluster with 64 
processors. 



2. Schwarz alternating procedure 

2.1 Lattice Dirac operator 

As usual the fields in lattice QCD are assumed to live on the sites x of a hypercubic 
four-dimensional lattice of finite extent. Periodic boundary conditions are imposed 
in all directions and the lattice spacing is set to unity for notational convenience. 

On the lattice the SU(3) gauge field is represented by group-valued link variables 
U(x,fi), ji = 0, ...,3, while the quark fields if)(x) carry Dirac and colour indices 
as in the continuum theory. The gauge-covariant forward and backward difference 
operators are then given by 



where jl denotes the unit vector in direction fi, and using these the Wilson-Dirac 



Vpipfr) = U(x, fi)ip(x + A) - ^0*0, 



(2.1) 



Vfy{x) = ip(x) - U(x - /}, n) 1 tp(x - jl), 



(2.2) 
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Fig. 1. Slice of a 6 4 block A (points inside the grey square) and its exterior boundary 
<9A (points outside the square) along a two-dimensional lattice plane. The white points 
in the square represent the interior boundary dA* of the block. 

operator may be written in the form 

D w = (V* + V„) - V*V M > ■ (2-3) 

A sum over fx is implied here and the Dirac matrices 7 M are taken to be hermitian. 
In the following, the Schwarz procedure will be set up for the linear equation 

Dil>{x) = ri(x), D = D w + m , (2.4) 

in which rj is an arbitrary source field and m the bare quark mass. 
2.2 Block terminology 

Like any other domain decomposition method, the Schwarz procedure operates on a 
covering of the lattice by a collection of domains. In the present context it is natural 
to choose the domains to be rectangular blocks of lattice points. Any such block is 
completely characterized by the lengths of its edges and by its position. In practice 
the blocks are usually taken to be fairly small, particularly on parallel computers, 
where for reasons of efficiency they should be contained in the local lattices, f 

The exterior boundary dA of a given block A is defined to be the set of all lattice 
points that have distance 1 from A (see fig. 1). Each exterior boundary point has a 
closest "partner" point in the block. The interior boundary dA* of A consists of all 
these points, and the set of points that are not in the block is denoted by A*. 

f Parallel programs in lattice QCD are usually based on a division of the lattice into sublattices 
that are associated to the processors of the computer or to different program threads. The term 
"local lattice" refers to these sublattices. 
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In position space the lattice Dirac operator is a sparse matrix that assumes the 
block-diagonal form 



if the lattice points are ordered so that those in A come first. The operator D A , for 
example, acts on quark fields on A in precisely the same way as D, except that all 
terms involving the exterior boundary points are set to zero (which is equivalent to 
imposing Dirichlet boundary conditions on OA). 

It is often convenient to consider D^, D^*, Dq\ and Dg^* to operate on quark 
fields on the whole lattice rather than on fields that are defined on A or A* only. 
The embedding is done in the obvious way by padding with zeros so that eq. (2.5), 
for example, may equivalently be written in the form 



This notation is perhaps slightly abusive but it will usually be clear from the context 
which interpretation applies. 

2.3 Definition of the Schwarz procedure 

In the form in which it was originally conceived [4] , the Schwarz procedure is assumed 
to operate on overlapping domains. The convergence of the method actually tends 
to be better the larger the overlaps are [1]. On the other hand, more domains are 
then required to cover the lattice, which slows down the computations, particularly 
in four dimensions, where the average occupation number is rapidly growing with the 
overlap size. If the Schwarz procedure is only used as a preconditioner (as will be the 
case here), it can thus be advantageous to choose the domains to be non-overlapping. 

The obvious choice is then to cover the lattice by a regular grid of non-overlapping 
rectangular blocks of equal size such as the one shown in fig. 2. Although this is not 
required for the Schwarz procedure, it will be assumed, for technical reasons, that 
the block sizes are even and that the grid can be chessboard-coloured. 

The so-called multiplicative Schwarz procedure now visits the blocks in the chosen 
block grid sequentially and updates the current approximation ip to the solution of 
the Wilson-Dirac equation (2.4) there. At the start of the iteration one can take 
ip = 0, for example. When the algorithm arrives at block A, the updated field ip' is 
obtained by solving the equations 




(2.5) 



D = D A + D A * + Dg\ + D dh * . 



(2.6) 



Dil)'{x) 



xeA 



rj(x) 



(2.7) 
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Fig. 2. Two-dimensional cross-section of a grid of non-overlapping 6 4 blocks that 
covers a lattice of size 24 x 12 3 . Such grids can be chessboard-coloured if there is an 
even number of blocks in all dimensions. 



It is straightforward to check that this amounts to setting 



Y/ = ^ + Dl 1 {r ] -D^) 



(2.8) 



where it is understood that acts on the restriction of the current residue r\ — Dip 
to the block A. This form of the iteration is numerically stable and is the one that 
is actually programmed. 

2.4 Schwarz iteration operator 

The structure of the Schwarz procedure becomes more transparent if it is rewritten 
in a slightly more compact form. To this end first note that the residue of the new 
approximation (2.8) coincides with the old residue, except on the block A and on the 
neighbouring blocks. So if all black and all white blocks are visited alternatingly, it 
is clear that the order of the blocks with the same colour is irrelevant and that they 
can in fact be updated simultaneously. 

Effectively the algorithm thus operates on only two domains, one being the union 
Q, of all black blocks and the other the union fT of all white blocks. With respect to 
this division of the lattice, the Dirac operator decomposes into a sum of operators 
Dq, Da*, Dgfi and Dqq*, as in the case of the division defined by a block A and 
its complement A*. Some algebra then shows that a complete Schwarz cycle, where 
the quark field is updated on all the black blocks followed by all the white blocks, 
amounts to the transformation 



Y> (1- KD)ip + Kr], 



(2.9) 



K = D^+D, 



1-1 



(2.10) 



The field thus evolves from ip = to 



(2.11) 



after n cy cycles, which shows that the Schwarz procedure attempts to obtain the 
solution in the form of a Neumann series with iteration operator 1 — DK. From this 
point of view, the operator K plays the role of a preconditioner for D. 



3. Preconditioned GCR algorithm 

Some numerical experimenting suggests that the spectral radius of the iteration oper- 
ator 1— DK is bounded by 1, at least as long as D is not exceptionally ill-conditioned. 
The rate of convergence of the Schwarz procedure is often disappointing, however, 
and it is thus advisable to combine the method with a Krylov space accelerator such 
as the generalized conjugate residual (GCR) algorithm. 

3. 1 Schwarz preconditioner 

The basic idea is to use the Schwarz procedure as a preconditioner rather than as a 
solver, and to apply the GCR algorithm to the preconditioned system. As discussed 
in the previous section, the Schwarz procedure can be seen as an operation acting 
on the source field, which produces an approximate solution of eq. (2.4), for any 
specified number n cy of cycles. If the associated linear operator is denoted by M sap , 
the right-preconditioned equation 



is thus expected to be better conditioned than the original linear system (once this 
equation is solved, the solution of the latter is obtained by setting ip = M sap </>). 
From eq. (2.11) it follows that M sap is given explicitly by 



In practice this formula is not used, however, and the calculation of (say) ip = M sap </> 
instead proceeds along the lines described in subsect. 2.3, i.e. by starting from ip = 



DM sap( f> = r] 



(3.1) 




(3.2) 
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and running through the blocks in the block grid, updating ip on each block, with <f> 
now being the source. 

3. 2 Outline of the GCR algorithm 

The GCR algorithm is a Krylov space solver that is mathematically equivalent to 
the GMRES algorithm (see ref. [2] for example). It offers some technical advantages 
with respect to the latter if the preconditioner is implemented approximately, as will 
be the case here. In the following a variant of the GCR algorithm is described that 
was introduced in the context of the so-called GMRESR algorithm [8,9]. 

Starting from tpo = 0, the GCR algorithm generates a sequence of approximate 
solutions ipk, k > 0, of the Wilson-Dirac equation (2.4) recursively. The associated 
residues are defined by 

p k = r]-Dtp k , (3.3) 

and the recursion then chooses ipk+i to be the field in the linear subspace spanned 
by Vi, • • ■ ,ipk,M sstp p k , which minimizes ||pfc+i||. 

It is important to understand that the field M sap pk merely defines the direction 
in which the previous subspace is extended. In particular, a precise implementation 
of the preconditioner is not really required, and its definition may even vary from 
step to step. The worst that can happen in this case is that the convergence of the 
algorithm slows down, but the algorithm will never become unstable or otherwise 
invalid. 

3.3 Definition of the GCR recursion 

The GCR algorithm generates two sequences of fields, Xk and that must be kept 
in the memory of the computer together with the last residue pk- A special feature 
of the chosen implementation is that the approximate solution ipk itself does not 
take part in the recursion and is only constructed at the end of the calculation. For 
reasons of efficiency the recursion may have to be restarted after a certain number of 
steps, taking the last residue as the new source. This point will be discussed again 
in sect. 4, together with other implementation details. 

Apart from the sequences of fields, the recursion generates sequences of complex 
numbers a^, bu and Cfc that should also be preserved. An explicit description of the 
algorithm is provided by the pseudo-code displayed in fig. 3. Note that the fields Xk 
satisfy 

{Xi,Xj) = $ij, 0<i,j<k, (3.4) 
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Po = V 

for k = 0,1,2, ... do 
£fc = M SEip p k 
Xk = D^ k 

for I = 0, . . . , k — 1 do 

azfc = (x/,Xfc) 
Xk = Xk ~ aikXi 
end do 

bk = llXfcll 
Xfc = Xfc/&fc 
Cfc = (Xk,Pk) 

Pk + l = Pk - CkXk 

end do 



Fig. 3. Pseudo-code for the GCR recursion. On the first line in the outer loop the 
Schwarz preconditioner is applied to the current residue pk ■ For the calculation of the 
next residue in the last two lines, the auxiliary field Xfc first needs to be constructed 
through a Gram-Schmidt orthogonalization process (inner loop). 

by the time pt+\ is calculated. The latter is then the projection of the first residue 
i] to the orthogonal complement of the space spanned by the fields Xoi ■ ■ ■ >Xk, which 
coincides with the space spanned by the fields D^ , . . . , D^k- In other words, 



for some complex coefficients a k chosen such that ||pfc+i|| is minimized. This shows 
that Pk+i is indeed the residue of the approximate solution ipk+i that was defined 
in subsect. 3.2, and the pseudo-code thus implements the GCR algorithm correctly. 

3.4 Calculation o/V'fc+i 

When it is decided to stop or restart the GCR recursion after k + 1 steps, the field 
V'fc+i will have to be computed. From the above it is obvious that 



k 




(3.5) 



A: 




(3.6) 
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but the coefficients ai [which are the same as in eq. (3.5)] are only implicitly known 
at this point. They can, however, be calculated by noting that the fields xi satisfy 



i-i 



= kxi + ^2auXi, I = 0, 1, . . . , k, 



(3.7) 



i=0 



and also 



k 



Pk+i = V ~^2 c iXi- 



(3.8) 



1=0 



Now if Pk+i =f] — Dipk+i is evaluated by first inserting eq. (3.6) and then eq. (3.7), 
the comparison with eq. (3.8) leads to the linear system 



which can be solved straightforwardly for the coefficients ai by back-substitution. 



4. Numerical implementation of the Schwarz procedure 

In this section some further details of the proposed algorithm are discussed, leaving 
away any purely programming issues. The Schwarz procedure is actually not entirely 
trivial to program, and a good choice of the data and program structures can be 
important (see appendix A for some suggestions). 

4-1 Block solver 

The first question to be addressed here is how to perform the inversion of the local 
Wilson-Dirac operator Da in the update step (2.8). Clearly an exact inversion is not 
possible in practice, and is in fact not needed, since an inaccurate implementation 
of the Schwarz preconditioner M sap is perfectly acceptable (cf. subsect. 3.2). 

Perhaps the simplest iterative solver that may be used in this context is the mi- 
nimal residual algorithm (ref. [2], §5.3.2). Independently of the value of the quark 
mass, only few iterations then are in general sufficient to reach the point where the 
rate of convergence of the GCR solver is not significantly improved by performing 



k 

bm+ ^2 auai = ci, I = k, k - 1, . . . ,0 

i=l+l 



(3.9) 
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further iterations. This is so in part because D\ tends to be well-conditioned, but it 
should also be noted that the Schwarz preconditioner is applied when the accuracy 
of the current approximate solution ipf. is anyway limited. 

The update step (2.8) takes the current residue p = r\ — Dtp as input and modifies 
the field ip on the block A. It is thus an operation that acts on these two globally 
defined fields in the following way: 

(a) The restriction of the residue p to the block A is copied to a block field p\. 
An approximate solution (\ of the block equation Z?aCa = Pa is then obtained by 
applying a fixed number n mr of minimal residual iterations. 

(b) if) is replaced by ip + (\ on A and is left untouched elsewhere. The residue p is 
also updated by adding /5a = Pa — -DaCa on A and subtracting Dg\*C\ on <9A. Note 
that the minimal residual algorithm returns both the approximate solution £a and 
the associated residue p\. 

Besides M sap r] this implementation of the Schwarz procedure yields DM SSLp r] essen- 
tially for free. There are two parameters, the number n cy of Schwarz cycles and the 
number n mr of minimal residual iterations performed in step (a), which should be 
adjusted so as to minimize the average computer time needed to obtain the solution 
of eq. (2.4) to the desired accuracy. 

4-2 Even- odd preconditioning 

If the Schwarz blocks are chosen to have an even number of points in all dimensions, 
even-odd preconditioning may be applied to accelerate the block solver. This simple 
modification is certainly worth the additional programming effort, since most of the 
computer time is spent in step (a). 

On the full lattice, the even-odd preconditioned form of the Wilson-Dirac equation 
(2.4) reads 



where the subscripts "e" and "o" refer to the even and the odd lattice sites. Equation 
(4.1) determines ijj e , while the solution on the odd points, 



Dtps = Ve- D eo (D OQ ) 1 rj, 



'Oj 



(4.1) 



D = D ee - D eo (D 00 )- 1 D t 



(4.2) 



ipo = (D 00 ) 1 (Vo - D oe ip e ) , 



(4.3) 



is obtained algebraically. 
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All these formulae carry over to the block equation £>aCa = Pa without any mod- 
ifications. In particular, the even-odd preconditioned Wilson-Dirac operator D\ is 
again given by eq. (4.2), with D replaced by D\. The minimal residual algorithm 
may then be used to solve the preconditioned system, in which case steps (a) and 
(b) together require the equivalent of n mr + 1 applications of D\ (if n mr iterations 
are performed) plus the calculation of about 2n mr scalar products and linear com- 
binations. 

4-3 Communication overhead 

On parallel computers each processor operates on a sublattice that is usually taken 
to be a rectangular block similar to the blocks on which the Schwarz procedure op- 
erates. In order to minimize the communication overhead, the local lattice should 
obviously be divisible by the Schwarz blocks. The block solver does not require any 
communications in this case, and it is only in step (b), when Dq\*Q\ is subtracted 
from the current residue, that some data need to be exchanged. In terms of "com- 
municated words per arithmetic operation" , a reduction by a factor n mr + 1 is thus 
achieved relative to the overhead in the program for the Wilson-Dirac operator. 

Another factor of 2 in the communication overhead can actually be saved by noting 
that 

(1 - n^)D dA *( A (x) = for all x € OA, (4.4) 

where n M is the outward normal vector to the boundary d A at x. There are thus two 
linear relations among the four Dirac components of Dq^* ("a {%) , and it is therefore 
sufficient to communicate the upper two components in the cases where a commu- 
nication is required, f 

Taking this into account, the update of the residue in step (b) proceeds by first 
adding /5a on all points in the block. The upper two components of Dqa*Ca are then 
calculated and stored in an auxiliary array. Some parts of this field may reside on the 
exterior boundary of the local lattice, in which case they have to be communicated 
to the appropriate neighbouring processes. Thereafter the full Dirac spinors -Dsa*Ca 
are reconstructed, using eq. (4.4), and subtracted from the residue. Note that the 
communication and subtraction can be delayed until all black (or all white) blocks 
have been visited. This allows the relevant parts of the fields on the boundaries of 
all the black (white) blocks to be transferred in a single data package per direction. 

I It is assumed here that the Dirac matrices are in a chiral representation, where 75 = 70717273 
is a diagonal matrix with entries 1,1,-1,-1 on the diagonal. The upper two components of the 
boundary field -Dqa'Ca are then guaranteed to be linearly independent. 



11 



4-4 Non-algorithmic accelerations 

Since the Schwarz preconditioner does not need to be accurately implemented, one 
may just as well use single-precision arithmetic in the program for the block solver. 
The GCR algorithm may also be coded in this way, with scalar products accumulated 
in double precision, provided it is restarted before the residue has decreased by more 
than a few orders of magnitude. Evidently, at each restart of the algorithm, the last 
residue should be recalculated using double-precision arithmetic to ensure that the 
solution of the Wilson-Dirac equation is obtained to full precision (see ref. [10] for 
a more detailed discussion of this point). 

All current PC processors support vector instruction sets (SSE, SSE2, 3DNow! or 
AltiVec) that allow short vectors of floating-point numbers to be processed simul- 
taneously [11]. The block solver and other time-critical parts of the program can be 
significantly accelerated using these additional capabilities, especially so if the fields 
on the Schwarz blocks fit into the cache of the processors. 



5. Tests of the algorithm 

As is generally the case for domain decomposition methods, the proposed algorithm 
is expected to perform particularly well on parallel computers. At this point nothing 
is known, however, about the convergence properties of the algorithm in the relevant 
range of parameters. The tests reported in this section serve to fill this gap and to 
show that an excellent parallel efficiency is indeed achieved on the computer where 
the tests were carried out (see appendix B for the technical specifications). 

5.1 Choice of parameters 

Different parameter sets have been considered, but most results were obtained for a 
particular choice, defined in the following lines, that will be referred to as the default 
parameter set. In general the strategy of the tests is to distribute the computations 
required for the solution of the Wilson-Dirac equation on a fixed lattice over an in- 
creasing number of processors, so that the effects of the communication overhead can 
be clearly seen. Taking this into account, and the fact that the available computer 
has 64 processors, it was decided to perform the tests on a 16 4 lattice. 

Samples of 100 statistically independent gauge field configurations were then gen- 
erated, using the Wilson plaquette action at (inverse) coupling (3 = 5.9, where the 
lattice spacing is about 0.10 fm. In each test run, five values of the bare quark mass 
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Table 1. Timings in [is [speed in Gflop/s] per lattice point and application of D 



number of 
processors 


local 
lattice 


(M sap V')32bit 


(£>^)64bit 


SAP+GCR 


BiCGstab 


1 


16 4 


0.64 [2.35] 


1.39 [0.98] 


0.75 


2.2 


2 


16 3 x 8 


0.86 [1.74] 


2.59 [0.52] 


1.12 


4.2 


4 


16 2 x 8 2 


0.86 [1.74] 


3.00 [0.45] 


1.10 


4.7 


8 


16 x 8 3 


0.88 [1.71] 


3.42 [0.40] 


1.11 


5.2 


16 


8 4 


0.90 [1.65] 


4.17 [0.33] 


1.13 


5.8 


32 


8 3 x 4 


0.94 [1.58] 


4.76 [0.28] 


1.24 


6.7 


64 


g2 x 4 2 


0.98 [1.53] 


5.50 [0.25] 


1.21 


7.3 



mo were considered, corresponding to values of k = (8 + 2mo) _1 equal to 0.1566, 
0.1574, 0.1583, 0.1589 and 0.1592. According to a recent large-scale study by the 
CP-PACS collaboration [12], the "pion" mass decreases from about 579 MeV to 
320 MeV in this range of quark masses. 

The sizes of the Schwarz blocks are constrained by the requirement that they 
should divide the local lattices. Larger blocks tend to yield a better preconditioner, 
but the block solver then consumes more processor time per lattice point. The exact 
choice is not critical, and a block size of 8 x 4 3 appears to be a good compromise for 
the given lattice and machine parameters. As for the remaining free parameters of 
the algorithm, some experimenting suggests to set n mr = 4, n cy = 5 and nk v = 16, 
where the latter limits the number of Krylov vectors that are generated by the GCR 
recursion before it is restarted. 

5.2 Reference algorithm 

To be able to judge the overall performance of the new algorithm, timings were also 
taken for the BiCGstab algorithm [13], which is probably the fastest Krylov space 
solver for the Wilson-Dirac equation [14]. The algorithm is applied to the even-odd 
preconditioned system (4.1), using a highly optimized code. In particular, as in the 
case of the Schwarz procedure discussed in subsect. 4.3, the program only transfers 
the upper two components of the Dirac spinors that need to be communicated, thus 
saving a factor of 2 in the communication overhead. 

A technical detail worth mentioning here is that single-precision acceleration [10] 
does not work too well in this case, because the increased arithmetic speed is practi- 
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»t x =579MeV 
I 

150 I 1 — I 



m =320MeV 



t [sec] 



100 



50 



BiCGstab 



SAP+GCR 



I , 1 , 1 , 1 , 1 , 1 

20 40 60 80 _i 

Fig. 4. Average execution time t needed to solve the Wilson-Dirac equation (2.4) 
as a function of the bare quark mass. A single processor is used here and the solvers 
are stopped when \\n — Dil>\\ < 10 _8 ||ry||. The lattice and algorithmic parameters are 
as specified in subsect. 5.1, and the dotted lines are drawn to guide the eye. 

cally compensated by the fact that more iterations are required to reach a specified 
precision. Moreover, there is a negative effect on the stability of the algorithm. The 
BiCGstab solver was therefore implemented in double-precision arithmetic, with all 
time-critical parts programmed using SSE2 vector instructions. 

5.3 Basic program and algorithm performance 

In terms of numbers of floating-point operations, the Schwarz preconditioner M sap 
is roughly equivalent to (n mr + l)n cy applications of D. The speed of the programs 
that implement the preconditioner and the complete solver (denoted "SAP+GCR" 
in the following) is therefore best measured in units of execution time per local 
lattice point and equivalent application of D. 

For the case where a single processor is used, some relevant performance figures 
are quoted in the first line of table 1. The associated Gflop/s rates, given in square 
brackets, refer to the actual number of floating-point operations. It should again be 
emphasized at this point that the use of single-precision arithmetic for the Schwarz 
preconditioner is entirely adequate and does not set a limit on the precision that can 
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t [sec] 




Fig. 5. Execution time required for the solution of the Wilson-Dirac equation (2.4) 
as a function of the number n p = 2, 4, . . . , 64 of processors used by the program. The 
parameters are as in table 1 and fig. 4, with the quark mass fixed to the smallest value 
considered there. Ideal parallel scaling is represented by the dashed lines. 



be reached by the SAP+GCR solver. Note that a BiCGstab iteration involves two 
applications of Z), so that the time per iteration and lattice point is twice the figure 
quoted in table 1. The linear algebra in the BiCGstab algorithm thus consumes more 
than a third of the total time spent by the single-processor program. This overhead 
is much smaller in the case of the SAP+GCR solver, because the application of the 
Schwarz preconditioner is relatively expensive in terms of computer time. 

In fig. 4 the time needed for the solution of the Wilson-Dirac equation to a speci- 
fied precision is plotted versus l/m q , where m q = m — m c denotes the additively 
renormalized quark mass and m c the critical bare mass. The SAP+GCR solver thus 
obtains the solution faster than the BiCGstab algorithm, by a factor ranging from 
1.6 to 1.9, if a single processor is used. When going from large to small quark masses, 
the curvature in the data suggests that the BiCGstab solver becomes relatively more 
efficient, but this trend is less pronounced on larger lattices and should therefore be 
taken finite- volume anomaly (cf. subsect. 5.5). 

Another result of these tests is that the GCR algorithm converges in comparatively 
few steps. On average only 17 Krylov vectors are generated at the largest quark mass 
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SAP+GCR 
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-1 



Fig. 6. Same as fig. 4, but now for a 48 x 24 3 lattice distributed over 16 processors. 
The sizes of the local lattices and the Schwarz blocks are 24 x 12 3 and 6 2 x 4 2 respec- 
tively, and rikv has been set to 24 in this case. All other parameters are unchanged. 

and 64 at the smallest mass, while 140 to 437 BiCGstab iterations are needed to 
obtain the solution to the specified precision. Since both algorithms have similar 
theoretical convergence properties [2], this suggests that the condition number of the 
Schwarz preconditioned Wilson-Dirac operator DM sap is roughly 15 times smaller 
than the condition number of D, at all quark masses that have been considered. 

5.4 Parallel efficiency 

Evidently these algorithmic properties are unchanged when the computational work 
is distributed over several processors, and the variations in the performance figures 
listed in table 1 thus provide a direct measure for the parallelization losses. The 
lattice and algorithmic parameters have been set to the default values in all these 
tests. In particular, the dimensionality of the parallelization and the communication 
overhead are increasing as one moves down the table. 

When going from 1 to 2 processors, the steep rise in the timings is, however, only 
partially due to the communication overhead. The principal cause for the slow-down 
instead is the fact that the two processors on the nodes of the computer share the 
available memory bandwidth (cf. appendix B). For this reason the parallel efficiency 
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of the programs should be judged by comparing the many-processor timings with the 
performance of the two-processor programs rather than the single-processor ones. 

The results quoted in table 1 show that the parallel efficiency of the programs 
for the Schwarz preconditioner and the SAP+GCR algorithm is very good. If the 
lattice is distributed over all 64 processors, for example, the surface-to-volume ratio 
of the local lattices is as large as 1.5, but the parallelization overhead in the program 
for the preconditioner nevertheless only consumes about 12% of the execution time. 
Moreover, this figure is even smaller in the case of the solver, because practically no 
communications are required in the linear algebra programs. Note incidentally that 
the computer achieves a total sustained speed of nearly 100 Gflop/s at this point. 

Given the available network bandwidth, it is no surprise, however, that the pro- 
gram for the even-odd preconditioned lattice Dirac operator and the BiCGstab solver 
do not scale as well. The difference is perhaps made clearer by fig. 5, where the time 
needed to solve the Wilson-Dirac equation is plotted as a function of the number of 
processors. In particular, if all 64 processors are used, and depending on the quark 
mass, the SAP+GCR solver is from 3.3 to 4.1 times faster than the BiCGstab solver. 

5.5 Larger lattices 

The timings taken on larger lattices suggest that the performance pattern remains 
practically the same as the one seen so far. Essentially the communication losses are 
determined by the surface-to-volume ratio of the local lattices, while the number of 
processors that are used appears to be less relevant. 

For illustration, the time needed to solve the Wilson-Dirac equation on a 48 x 24 3 
lattice is plotted in fig. 6. In this test 16 processors were used and the surface-to- 
volume ratio of the local lattices was only 0.6. The plot shows that the SAP+GCR 
and the BiCGstab algorithms have a similar scaling behaviour as a function of the 
quark mass, the SAP+GCR solver being faster by a factor ranging from 2.5 to 2.9. 
Moreover, the ratio of the iteration counts of the two algorithms is not very different 
from the one observed before, although the convergence is generally slower on this 
lattice than on the 16 4 lattice. 



6. Conclusions 

The algorithm presented in this paper provides the first example for a successful ap- 
plication of a domain decomposition method in lattice QCD. Traditionally the lattice 
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Dirac equation is solved using a general purpose algorithm such as the BiCGstab 
solver. Domain decomposition methods are fundamentally different from these, be- 
cause they do not operate in the usual Krylov spaces. The locality and ellipticity of 
the equation are instead exploited to construct a highly effective preconditioner. 

An obvious advantage of domain decomposition methods is that they are usually 
well suited for parallel processing. In the case of the algorithm described here, for 
example, the number of words per arithmetic operation which must be communi- 
cated is smaller than the communication rate in the BiCGstab algorithm by about 
a factor 5. This may not be an impressive figure, but it should be noted in this con- 
nection that a reduction of the communication overhead by a factor 2 alone allows 
16 times more processors to be used efficiently once the program is parallelized in 
four dimensions (and runs well in this mode). 

In terms of condition numbers, the non-overlapping Schwarz procedure that was 
considered in this paper yields an excellent preconditioner for the Wilson-Dirac 
operator. The numerical tests showed this very clearly, but another outcome of the 
tests was that the scaling behaviour of the SAP+GCR solver with respect to the 
quark mass is rather similar to the one of the BiCGstab algorithm. It is conceivable 
that more sophisticated domain decomposition methods can do better than this, and 
there are in fact quite a few cases in other branches of science, where such methods 
come close to having an ideal (i.e. constant) scaling behaviour [1]. For lattice QCD 
any progress in this direction would evidently be a major step forward. 

The numerical work reported in this paper was carried out on a PC-cluster at the 
Institut fur Theoretische Physik der Universitat Bern, which was funded in part by 
the Schweizerischer Nationalfonds. I wish to thank both institutions for the support 
given to this project and Peter Weisz for a critical reading of the manuscript. 



Appendix A. Programming issues 



The remarks collected in this appendix provide some guidelines for the programming 
of the Schwarz procedure. They are based on the experience gained in the course of 
this work, where all programs were written in C, using the standard message passing 
interface (MPI) for the communications between the processors. Evidently, the rec- 
ommended strategies may not be appropriate in other programming environments. 
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A.l Object orientation 

The Schwarz procedure is an obvious case for an object-oriented approach. A block 
of lattice points, for example, is best represented by a structure that contains the 
block fields and the local geometry arrays. There should be enough predefined infor- 
mation in this structure that the block solver can proceed without reference to the 
surrounding lattice. In particular, it is advantageous to have a copy of the relevant 
gauge field variables on each block. Other useful objects are block boundaries and 
the block grids on which the Schwarz procedure operates. 

Among the members of the block and boundary structures, the geometry arrays 
play a central role. Lattice points are usually labelled by an integer index ix in an 
arbitrary way. The embedding in the global lattice of a block point with local index 
ix thus amounts to specifying its global index imb [ix] . Similarly each boundary 
point with local index ix has a unique partner point in the block with block index 
ipp [ix] . This approach views blocks, boundaries and the global lattice as nothing 
more than ordered sets of points, whose geometry is encoded in the appropriate 
index arrays. 

Once the structures representing the objects have been defined, some utility pro- 
grams need to be written for object creation, destruction and other basic manipula- 
tions. In particular, since blocks have their own fields, there must be functions that 
copy the relevant parts of the global fields to and from a specified block. 

A. 2 Generic programs 

The programming effort as well as the program sizes can be significantly reduced by 
writing generic programs. Linear algebra routines, for example, only need to know 
the starting addresses of the spinor fields on which they operate and the number of 
spinors in the fields. The same programs can then be applied independently of the 
geometric context. 

In the case of the programs for the Wilson-Dirac operator, a generic code requires 
as input the addresses of the gauge field, of the source and the target spinor fields, 
and of the nearest-neighbour index array. Evidently the quark mass and the number 
of elements in the fields must also be specified, but apart from this the program does 
not need any further information to be able to proceed. Different local geometries 
and boundary conditions are then easily taken into account through predefined index 
arrays. 

A. 3 Parallelization 

If the global lattice is to be distributed over a large number of processors, it will be 
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preferable, in general, to parallelize the program in four dimensions. Communica- 
tions in eight directions are required in this case and some care should obviously be 
taken to ensure that the associated overhead remains small. 

Data transfers via MPI send and receive functions are characterized by the point- 
to-point bandwidth of the network and the start-up latency. The latter can be kept 
to a minimum by using so-called persistent communication requests and by aligning 
the data arrays in memory so that any unnecessary data packing is avoided. It may 
actually be quite important to follow this advice, since the data packages that are 
exchanged in the course of the Schwarz procedure tend to be relatively small. In 
particular, advantage should be taken of the chessboard colouring of the Schwarz 
blocks to maximize the package sizes, as explained at the end of subsect. 4.3. 



Appendix B. Machine parameters 

The tests reported in sect. 5 were performed on a commodity PC-cluster with 32 
nodes, each consisting of a motherboard with Intel E7500 chipset, two Intel Xeon 
processors (2.4 GHz, 512 KB cache), 1 GB of shared ECC DDR-200 memory and 
a Myrinet-2000 communication card. Fibre cables connect the Myrinet cards to a 
fast switch, which allows for arbitrary bidirectional point-to-point communications 
between the nodes. 

The tested programs are written in C with all communications (including intra- 
node data transfers) programmed using standard MPI functions. In most cases the 
best performance is obtained by having one process on each processor. In particular, 
if the program is written so that the two processors on the nodes send and receive 
data in an alternating fashion, the bidirectional bandwidth of the Myrinet network 
can then be saturated without any further programming effort. 

On the software side, the gcc compiler version 3.2.2 was used with optimization 
options -mcpu=i586 -malign-double -f no-force-mem -0. Following ref. [11], SSE 
and SSE2 vector instructions were included through embedded assembler statements. 
The programs were then linked to the MPICH-over-GM library provided by Myricom 
to enable the MPI functionality. 

In linear algebra programs (where the time required for the arithmetic operations 
is negligible), the effective total memory-to-processor bandwidth on each node is in 
the range from 1.6 to 2.4 GB/s. For bidirectional node-to-node communications and 
data packages larger than 1 MB, the network achieves a throughput of 160 MB/s in 
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each direction. If persistent communication requests are used (in MPI parlance), the 
effective bandwidth per link and direction is still 92 MB/s for packages as small as 
4 KB. Unidirectional data transfers are generally faster and reach about 240 MB/s. 
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